22 research outputs found
Canonical, Stable, General Mapping using Context Schemes
Motivation: Sequence mapping is the cornerstone of modern genomics. However,
most existing sequence mapping algorithms are insufficiently general.
Results: We introduce context schemes: a method that allows the unambiguous
recognition of a reference base in a query sequence by testing the query for
substrings from an algorithmically defined set. Context schemes only map when
there is a unique best mapping, and define this criterion uniformly for all
reference bases. Mappings under context schemes can also be made stable, so
that extension of the query string (e.g. by increasing read length) will not
alter the mapping of previously mapped positions. Context schemes are general
in several senses. They natively support the detection of arbitrary complex,
novel rearrangements relative to the reference. They can scale over orders of
magnitude in query sequence length. Finally, they are trivially extensible to
more complex reference structures, such as graphs, that incorporate additional
variation. We demonstrate empirically the existence of high performance context
schemes, and present efficient context scheme mapping algorithms.
Availability and Implementation: The software test framework created for this
work is available from
https://registry.hub.docker.com/u/adamnovak/sequence-graphs/.
Contact: [email protected]
Supplementary Information: Six supplementary figures and one supplementary
section are available with the online version of this article.Comment: Submission for Bioinformatic
An Average-Case Sublinear Exact Li and Stephens Forward Algorithm
Hidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithms as long as the representative reference panel used in the model is sufficiently small. Specifically, the monoploid Li and Stephens model and its variants are linear in reference panel size unless heuristic approximations are used. However, sequencing projects numbering in the thousands to hundreds of thousands of individuals are underway, and others numbering in the millions are anticipated.
To make the Li and Stephens forward algorithm for these datasets computationally tractable, we have created a numerically exact version of the algorithm with observed average case O(nk^{0.35}) runtime in number of genetic sites n and reference panel size k. This avoids any tradeoff between runtime and model complexity. We demonstrate that our approach also provides a succinct data structure for general purpose haplotype data storage. We discuss generalizations of our algorithmic techniques to other hidden Markov models
Recommended from our members
Tools for large and detailed experiments in genomics and tissue development
In this dissertation I present algorithmic and data representation advances in genomics as well as tools for a new bioinformatic approach to mammalian cell culture experiments which I call highly instrumented cell culture. The first section deals with fast variants of the forward algorithm for the Li and Stephens copying model of haplotypes derived from a population. I introduce a direct optimization of the Li and Stephens model forward algorithm which performs the identical calculation, without any approximations, but achieves this in average case sublinear time. This is an improvement over the classical algorithm which is at best linear time. I achieve this by using a sparse representation of the population haplotypes and by introducing an efficient lazy evaluation scheme. I also introduce a generalization of the recombination modeling component of the Li and Stephens model which operates on haplotypes and populations encoded in variation graphs. The second section deals with algebraic representations of genetic sites in variation graphs. I introduce the concept of the bundle, a motif in bidirected graphs which leads to a well defined concept of adjacency of sets of nodes. This allows a granular decomposition of the graph into sites which extends prior work on ultrabubbles and snarls previously reported by Paten et al. Lastly, I introduce the concept of highly instrumented cell culture and some technologies to enable it. I demonstrate a low-cost, robust, arbitrarily scalable microscope array for simultaneous parallel continuous time-series microscopy. I demonstrate new approaches to rapid prototyping of labware and fluidic actuators. I also demonstrate principles and implementation of incubator-free cell culture, which is my approach to cell culture in media containing carbonic acid-carbonate-bicarbonate buffer systems without using any carbon dioxide rich gas chamber. I finally describe how these technologies integrate together to enable the creation of highly instrumented, automated, data rich biology experiments
Recommended from our members
Modelling haplotypes with respect to reference cohort variation graphs
MotivationCurrent statistical models of haplotypes are limited to panels of haplotypes whose genetic variation can be represented by arrays of values at linearly ordered bi- or multiallelic loci. These methods cannot model structural variants or variants that nest or overlap.ResultsA variation graph is a mathematical structure that can encode arbitrarily complex genetic variation. We present the first haplotype model that operates on a variation graph-embedded population reference cohort. We describe an algorithm to calculate the likelihood that a haplotype arose from this cohort through recombinations and demonstrate time complexity linear in haplotype length and sublinear in population size. We furthermore demonstrate a method of rapidly calculating likelihoods for related haplotypes. We describe mathematical extensions to allow modelling of mutations. This work is an important incremental step for clinical genomics and genetic epidemiology since it is the first haplotype model which can represent all sorts of variation in the population.Availability and implementationAvailable on GitHub at https://github.com/yoheirosen/vg [email protected] informationSupplementary data are available at Bioinformatics online
An average-case sublinear forward algorithm for the haploid Li and Stephens model
Abstract Background Hidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithm as long as the representative reference panel used in the model is sufficiently small. Specifically, the monoploid Li and Stephens model and its variants are linear in reference panel size unless heuristic approximations are used. However, sequencing projects numbering in the thousands to hundreds of thousands of individuals are underway, and others numbering in the millions are anticipated. Results To make the forward algorithm for the haploid Li and Stephens model computationally tractable for these datasets, we have created a numerically exact version of the algorithm with observed average case sublinear runtime with respect to reference panel size k when tested against the 1000 Genomes dataset. Conclusions We show a forward algorithm which avoids any tradeoff between runtime and model complexity. Our algorithm makes use of two general strategies which might be applicable to improving the time complexity of other future sequence analysis algorithms: sparse dynamic programming matrices and lazy evaluation
Recommended from our members
Superbubbles, Ultrabubbles, and Cacti
A superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. In this study, we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which, we show, encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Further, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats [e.g., variant cell format (VCF)]
Picroscope: low-cost system for simultaneous longitudinal biological imaging.
Simultaneous longitudinal imaging across multiple conditions and replicates has been crucial for scientific studies aiming to understand biological processes and disease. Yet, imaging systems capable of accomplishing these tasks are economically unattainable for most academic and teaching laboratories around the world. Here, we propose the Picroscope, which is the first low-cost system for simultaneous longitudinal biological imaging made primarily using off-the-shelf and 3D-printed materials. The Picroscope is compatible with standard 24-well cell culture plates and captures 3D z-stack image data. The Picroscope can be controlled remotely, allowing for automatic imaging with minimal intervention from the investigator. Here, we use this system in a range of applications. We gathered longitudinal whole organism image data for frogs, zebrafish, and planaria worms. We also gathered image data inside an incubator to observe 2D monolayers and 3D mammalian tissue culture models. Using this tool, we can measure the behavior of entire organisms or individual cells over long-time periods